Note:
Upgrade the matplotlib version form 3.2.2 to 3.5.3
Use command
!pip install matplotlib --upgrade
This business case is about one of the most popular media and video streaming platforms. They have over 10000 movies or TV shows available on their platform, as of mid-2021, they have over 222M Subscribers globally. They are planning to launch new shows based on data driven analysis. The business has provided us a tabular dataset consisting of listings of all the movies and TV shows available on the platform, along with details such as - cast, directors, ratings, release year, duration, etc.
By analysis data related to all the shows on the platform, it is interested in generating insights that could help them in deciding which type of shows/movies to produce and how they can grow the business in different countries. It is mainly interested in knowing the tailor maid strategies to grow their business in different countries, also to understand what type of shows has worked for them.
This study is based on the assumption that all the shows that are listed in the dataset has good user response and that is why they are still on the platform. It has presence in almost all the countries. However, this study will focus mainly on the countries having most number of shows, since the platform already has good presence in those countries and we have a good amount of data to understand user preference.
The shows on the platform are of two types: Movies and TV Shows. It is generally observed that the users, those prefer Movies are different that those who prefer TV Shows. Some users prefer shorter format of shows to be watched over series of episodes, while some prefer spending some more time and watching a Movie. The users also has their own favorites when it comes to actors and directors. The directors and actors working in Movies are generally different from TV Shows. Therefore, in this study we will separate the data with respect to show type and analyze each variable to understand user preferences.
We will carry out the following steps:
Importing Required Libraries
# !pip install matplotlib==3.5.3
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
from matplotlib.ticker import (MultipleLocator, AutoMinorLocator)
import seaborn as sns
import textwrap
Download the DataSet
df = pd.read_csv('/content/Streaming_Data.csv')
df.head()
| show_id | type | title | director | cast | country | date_added | release_year | rating | duration | listed_in | description | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | s1 | Movie | Dick Johnson Is Dead | Kirsten Johnson | NaN | United States | September 25, 2021 | 2020 | PG-13 | 90 min | Documentaries | As her father nears the end of his life, filmm... |
| 1 | s2 | TV Show | Blood & Water | NaN | Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban... | South Africa | September 24, 2021 | 2021 | TV-MA | 2 Seasons | International TV Shows, TV Dramas, TV Mysteries | After crossing paths at a party, a Cape Town t... |
| 2 | s3 | TV Show | Ganglands | Julien Leclercq | Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi... | NaN | September 24, 2021 | 2021 | TV-MA | 1 Season | Crime TV Shows, International TV Shows, TV Act... | To protect his family from a powerful drug lor... |
| 3 | s4 | TV Show | Jailbirds New Orleans | NaN | NaN | NaN | September 24, 2021 | 2021 | TV-MA | 1 Season | Docuseries, Reality TV | Feuds, flirtations and toilet talk go down amo... |
| 4 | s5 | TV Show | Kota Factory | NaN | Mayur More, Jitendra Kumar, Ranjan Raj, Alam K... | India | September 24, 2021 | 2021 | TV-MA | 2 Seasons | International TV Shows, Romantic TV Shows, TV ... | In a city of coaching centers known to train I... |
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 8807 entries, 0 to 8806 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 show_id 8807 non-null object 1 type 8807 non-null object 2 title 8807 non-null object 3 director 6173 non-null object 4 cast 7982 non-null object 5 country 7976 non-null object 6 date_added 8797 non-null object 7 release_year 8807 non-null int64 8 rating 8803 non-null object 9 duration 8804 non-null object 10 listed_in 8807 non-null object 11 description 8807 non-null object dtypes: int64(1), object(11) memory usage: 825.8+ KB
Shape:
12 columns (attributes)8807 data pointsData Type:
release_year is of int64 DtypeAll the remaining attributes are of object Dtype. release_year): Missing value detection, statistical summary¶df.describe()
| release_year | |
|---|---|
| count | 8807.000000 |
| mean | 2014.180198 |
| std | 8.819312 |
| min | 1925.000000 |
| 25% | 2013.000000 |
| 50% | 2017.000000 |
| 75% | 2019.000000 |
| max | 2021.000000 |
df['release_year'].isna().sum()
0
Missing Values:
Statistical Summary:
1925 to 202150% of the data is of the shows that are released from 2017 till 2021 (~5 yrs)25% of the data is of the shows that are released from 2013 till 2017 (~4 yrs)25% of the data is of the shows that are released from 1925 till 2013 (~88 yrs)Data Cleaning:
# Striping the values in Non-Numeric columns
df1 = df.drop('release_year', axis=1).apply(lambda x: x.str.rstrip()).apply(lambda x: x.str.lstrip())
# Assign to NaN
df1 = df1.apply(lambda series_: series_.apply(lambda value_: np.nan if value_=='' else value_))
# Update the DF
df[df1.columns] = df1
del df1
df.head()
| show_id | type | title | director | cast | country | date_added | release_year | rating | duration | listed_in | description | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | s1 | Movie | Dick Johnson Is Dead | Kirsten Johnson | NaN | United States | September 25, 2021 | 2020 | PG-13 | 90 min | Documentaries | As her father nears the end of his life, filmm... |
| 1 | s2 | TV Show | Blood & Water | NaN | Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban... | South Africa | September 24, 2021 | 2021 | TV-MA | 2 Seasons | International TV Shows, TV Dramas, TV Mysteries | After crossing paths at a party, a Cape Town t... |
| 2 | s3 | TV Show | Ganglands | Julien Leclercq | Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi... | NaN | September 24, 2021 | 2021 | TV-MA | 1 Season | Crime TV Shows, International TV Shows, TV Act... | To protect his family from a powerful drug lor... |
| 3 | s4 | TV Show | Jailbirds New Orleans | NaN | NaN | NaN | September 24, 2021 | 2021 | TV-MA | 1 Season | Docuseries, Reality TV | Feuds, flirtations and toilet talk go down amo... |
| 4 | s5 | TV Show | Kota Factory | NaN | Mayur More, Jitendra Kumar, Ranjan Raj, Alam K... | India | September 24, 2021 | 2021 | TV-MA | 2 Seasons | International TV Shows, Romantic TV Shows, TV ... | In a city of coaching centers known to train I... |
Missing Values
df.isna().sum()
show_id 0 type 0 title 0 director 2634 cast 825 country 831 date_added 10 release_year 0 rating 4 duration 3 listed_in 0 description 0 dtype: int64
No missing values in columns
Some columns have missing values
Statistical Summary
df.describe(include=object)
| show_id | type | title | director | cast | country | date_added | rating | duration | listed_in | description | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 8807 | 8807 | 8807 | 6173 | 7982 | 7976 | 8797 | 8803 | 8804 | 8807 | 8807 |
| unique | 8807 | 2 | 8806 | 4528 | 7692 | 748 | 1714 | 17 | 220 | 514 | 8775 |
| top | s1 | Movie | Consequences | Rajiv Chilaka | David Attenborough | United States | January 1, 2020 | TV-MA | 1 Season | Dramas, International Movies | Paranormal activity at a lush, abandoned prope... |
| freq | 1 | 6131 | 2 | 19 | 19 | 2818 | 110 | 3207 | 1793 | 362 | 4 |
show_id:
type:
title:
director:
cast:
country:
date_added:
rating:
duration:
listed_in:
description:
Conversion of Categorical Attributes to 'category'
# Convert show type to category Dtype
df['type'] = df['type'].astype('category')
# Combine ratings having same meaning
df.loc[df['rating'].isin(['TV-MA', 'R', 'NC-17']), 'rating'] = '17+'
df.loc[df['rating'].isin(['G', 'TV-G']), 'rating'] = 'General'
df.loc[df['rating'].isin(['NR', 'UR']), 'rating'] = 'Not Rated'
df.loc[df['rating']=='TV-Y', 'rating'] = 'All Kids'
df.loc[df['rating'].isin(['TV-PG', 'PG']), 'rating'] = 'Parental Guidance'
df.loc[df['rating']=='PG-13', 'rating'] = '13+'
df.loc[df['rating']=='TV-14', 'rating'] = '14+'
df.loc[df['rating'].isin(['TV-Y7', 'TV-Y7-FV']), 'rating'] = '7+'
# Convert rating column Dtype to category
df['rating'] = df['rating'].astype('category')
To find the best guess for the missing values, I tried to creat diffrent groups and observed for any pattern but I did not find any pattern.
Assign missing director, cast and country values as Unknown
df[['director', 'cast', 'country']] = df[['director', 'cast', 'country']].fillna('Unknown')
Convert the date_added column Dtype to datetime
To find the best guess for the missing date_added, I tried to creat diffrent groups and observed if ther is any pattern/ correlation with release date, type, directory, cast and country; but I could not find any.
Since missing date_added values are less in numbers (10). I would keep it as NaN.
Check if the date_added is before the release_year of any show. If it is so, it is a wrong information. We are not sure if the release_year or date_added is correct. Therefore for such records we make those values NaN.
df['date_added'] = pd.to_datetime(df['date_added'])
df.loc[df['date_added'].dt.year < df['release_year'], ['date_added', 'release_year']] = np.nan
df.loc[df['duration'].isna()]
| show_id | type | title | director | cast | country | date_added | release_year | rating | duration | listed_in | description | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5541 | s5542 | Movie | Louis C.K. 2017 | Louis C.K. | Louis C.K. | United States | 2017-04-04 | 2017.0 | 74 min | NaN | Movies | Louis C.K. muses on religion, eternal love, gi... |
| 5794 | s5795 | Movie | Louis C.K.: Hilarious | Louis C.K. | Louis C.K. | United States | 2016-09-16 | 2010.0 | 84 min | NaN | Movies | Emmy-winning comedy writer Louis C.K. brings h... |
| 5813 | s5814 | Movie | Louis C.K.: Live at the Comedy Store | Louis C.K. | Louis C.K. | United States | 2016-08-15 | 2015.0 | 66 min | NaN | Movies | The comic puts his trademark hilarious/thought... |
Duration values of the shows are shifted to rating columns. Replace the duration from rating column to duration column
df.loc[df['duration'].isna(), ['rating', 'duration']] = df.loc[df['duration'].isna(), ['duration', 'rating']].values
Recalculate rating categories
df['rating'] = df['rating'].astype('string')
df['rating'] = df['rating'].astype('category')
Make the duration units uniform
For type: Movie
check for different units used
df.loc[df['type']=='Movie', 'duration'].str.split(' ').apply(lambda x: '-'.join([i for i in x if i.isalpha()])).unique()
array(['min'], dtype=object)
All the durations of type Movie are in min
Check if all the values in the durtion column contains only one numeric value
df.loc[df['type']=='Movie', 'duration'].str.split(' ').apply(lambda x: len([i for i in x if i.isnumeric()])==1).all()
True
Keep only the numeric values
df.loc[df['type']=='Movie', 'duration'] = df.loc[df['type']=='Movie', 'duration'].str.split(' ').apply(lambda x: [i for i in x if i.isnumeric()][0])
For type: TV Show
check for different units used
df.loc[df['type']=='TV Show', 'duration'].str.split(' ').apply(lambda x: '-'.join([i for i in x if i.isalpha()])).unique()
array(['Seasons', 'Season'], dtype=object)
All the durations of type TV Show are in either Season for 1 season and Seasons for multiple seasons
Check if all the values in the durtion column contains only one numeric value
df.loc[df['type']=='TV Show', 'duration'].str.split(' ').apply(lambda x: len([i for i in x if i.isnumeric()])==1).all()
True
Keep only the numeric values
df.loc[df['type']=='TV Show', 'duration'] = df.loc[df['type']=='TV Show', 'duration'].str.split(' ').apply(lambda x: [i for i in x if i.isnumeric()][0])
Convert the Dtype to integet
df['duration'] = df['duration'].astype('int')
Assign Missing Rating values to Not Rated Category
df.loc[df['rating'].isna(), 'rating'] = 'Not Rated'
df.isna().sum()
show_id 0 type 0 title 0 director 0 cast 0 country 0 date_added 24 release_year 14 rating 0 duration 0 listed_in 0 description 0 dtype: int64
As discussed, all the missing values are treated. As discussed there are some missing values in date_added and release_year columns.
df.groupby('type').describe()
| release_year | duration | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | mean | std | min | 25% | 50% | 75% | max | count | mean | std | min | 25% | 50% | 75% | max | |
| type | ||||||||||||||||
| Movie | 6129.0 | 2013.119759 | 9.679256 | 1942.0 | 2012.0 | 2016.0 | 2018.0 | 2021.0 | 6131.0 | 99.564998 | 28.289504 | 3.0 | 87.0 | 98.0 | 114.0 | 312.0 |
| TV Show | 2664.0 | 2016.593468 | 5.749193 | 1925.0 | 2016.0 | 2018.0 | 2020.0 | 2021.0 | 2676.0 | 1.764948 | 1.582752 | 1.0 | 1.0 | 1.0 | 2.0 | 17.0 |
df.groupby('type').describe(include=np.datetime64, datetime_is_numeric=True)
| date_added | |||||||
|---|---|---|---|---|---|---|---|
| count | mean | min | 25% | 50% | 75% | max | |
| type | |||||||
| Movie | 6129 | 2019-05-07 06:56:47.929515264 | 2008-01-01 | 2018-04-01 00:00:00 | 2019-06-20 00:00:00 | 2020-07-24 | 2021-09-25 |
| TV Show | 2654 | 2019-06-10 13:43:05.380557568 | 2008-02-04 | 2018-04-21 18:00:00 | 2019-08-15 12:00:00 | 2020-10-01 | 2021-09-24 |
df.groupby('type').describe(include='category')
| rating | ||||
|---|---|---|---|---|
| count | unique | top | freq | |
| type | ||||
| Movie | 6131 | 8 | 17+ | 2862 |
| TV Show | 2676 | 7 | 17+ | 1147 |
df.loc[df['director']!='Unknown', 'director'].describe(include=object)
count 6173 unique 4528 top Rajiv Chilaka freq 19 Name: director, dtype: object
df.loc[df['cast']!='Unknown', 'cast'].describe(include=object)
count 7982 unique 7692 top David Attenborough freq 19 Name: cast, dtype: object
df.loc[df['country']!='Unknown', 'country'].describe(include=object)
count 7976 unique 748 top United States freq 2818 Name: country, dtype: object
df.loc[df['listed_in']!='Unknown', 'listed_in'].describe(include=object)
count 8807 unique 514 top Dramas, International Movies freq 362 Name: listed_in, dtype: object
Check Movies with title 'Consequences' as there are two movies with same title
df.loc[df['title']=='Consequences']
| show_id | type | title | director | cast | country | date_added | release_year | rating | duration | listed_in | description | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3371 | s3372 | Movie | Consequences | Ozan Açıktan | Nehir Erdoğan, Tardu Flordun, İlker Kaleli, Se... | Turkey | 2019-10-25 | 2014.0 | 17+ | 106 | Dramas, International Movies, Thrillers | Secrets bubble to the surface after a sensual ... |
| 6529 | s6530 | Movie | Consequences | Ozan Açıktan | Nehir Erdoğan, Tardu Flordun, İlker Kaleli, Se... | Turkey | 2019-10-25 | 2014.0 | 17+ | 106 | Dramas, International Movies, Thrillers | Secrets bubble to the surface after a sensual ... |
Remove the duplicate movie record
df = df.drop_duplicates(subset='title')
df['type'].value_counts()
Movie 6130 TV Show 2676 Name: type, dtype: int64
(df['type'].value_counts()/ df.shape[0] * 100).round(2)
Movie 69.61 TV Show 30.39 Name: type, dtype: float64
df.groupby('type')[['show_id', 'title']].nunique()
| show_id | title | |
|---|---|---|
| type | ||
| Movie | 6130 | 6130 |
| TV Show | 2676 | 2676 |
Columns show_id and title are unique for each show
There are 70% Movies (6130) and 30% TV Shows (2676)
df.loc[df['type']=='Movie', 'director'].value_counts()
Unknown 188
Rajiv Chilaka 19
Raúl Campos, Jan Suter 18
Suhas Kadav 16
Marcus Raboy 15
...
Sion Sono 1
Dennis Rovira van Boekholt 1
Naoto Amazutsumi 1
Jenny Gage 1
Mozez Singh 1
Name: director, Length: 4355, dtype: int64
df.loc[df['type']=='TV Show', 'director'].value_counts()
Unknown 2446
Alastair Fothergill 3
Hsu Fu-chun 2
Iginio Straffi 2
Shin Won-ho 2
...
Juliana Vicente 1
Chang Chin-jung, Chen Rong-hui 1
Thierry Demaizière, Alban Teurlai 1
Manolo Caro 1
Michael Cumming 1
Name: director, Length: 223, dtype: int64
df.loc[df['type']=='Movie', 'cast'].value_counts()
Unknown 475
Vatsal Dubey, Julie Tejwani, Rupa Bhimani, Jigna Bhardwaj, Rajesh Kava, Mousam, Swapnil 13
Samuel West 10
Jeff Dunham 7
Kevin Hart 6
...
Hánssel Casillas, Loreto Peralta, Andrea Sutton, Luis De La Rosa, Alejandro Flores, Anajosé Aldrete, Daniel Haddad, Montserrat Marañón, Juan Carlos Viana Prieto, Hernán Del Riego 1
Oscar Martínez, Inma Cuesta, Mafalda Carbonell, Nacho López, Isabel Requena, Aina Clotet, Antonio Valero, María Zamora, Hugo Balaguer, Valeria Schoneveld 1
Casey Affleck, Rooney Mara, Liz Franke, Rob Zabrecky, Will Oldham, Sonia Acevedo 1
Faizon Love, Kirk Fox, Tony Todd, Pearl Thusi, Cassie Clare, Fiona Ramsey, Neels Clasen, Roxy Nel 1
Vicky Kaushal, Sarah-Jane Dias, Raaghav Chanana, Manish Chaudhary, Meghna Malik, Malkeet Rauni, Anita Shabdish, Chittaranjan Tripathy 1
Name: cast, Length: 5446, dtype: int64
df.loc[df['type']=='TV Show', 'cast'].value_counts()
Unknown 350
David Attenborough 14
Michela Luci, Jamie Watson, Anna Claire Bartlam, Dante Zee, Eric Peterson 4
Dave Chappelle 3
You, Reina Triendl, Yoshimi Tokui, Azusa Babazono, Ryota Yamasato, Shono Hayama 2
...
Vir Das, Ranvir Shorey, Manoj Pahwa, Ravi Kishan, Raza Murad, Inaamulhaq, Suhail Nayyar, Amrita Bagchi 1
Sebastián Rulli, Renata Notni, Roberto Mateos, Irina Baeva, Cassandra Sánchez Navarro, Manuel Balbi, Javier Gómez, Alejandro Ávila, Sofía Castro, Juan Pablo Gil 1
Kenya Barris, Rashida Jones, Iman Benson, Genneya Walton, Scarlet Spencer, Justin Claiborne, Ravi Cabot-Conyers 1
Lee Min-ho, Kim Go-eun, Woo Do‑hwan, Kim Kyung-nam, Jung Eun-chae, Lee Jung-jin, Kim Young-ok 1
Sanam Saeed, Fawad Khan, Ayesha Omer, Mehreen Raheel, Sheheryar Munawar, Samina Peerzada, Waseem Abbas, Javed Sheikh, Hina Khawaja Bayat 1
Name: cast, Length: 2284, dtype: int64
df.loc[df['type']=='Movie', 'country'].value_counts()
United States 2058
India 893
Unknown 440
United Kingdom 206
Canada 122
...
United Kingdom, Russia, United States 1
Paraguay, Argentina 1
United Kingdom, Malawi 1
Austria, Iraq, United States 1
United Arab Emirates, Jordan 1
Name: country, Length: 652, dtype: int64
df.loc[df['type']=='TV Show', 'country'].value_counts()
United States 760
Unknown 391
United Kingdom 213
Japan 169
South Korea 158
...
Belarus 1
United Kingdom, Australia 1
France, Australia, Germany 1
Australia, New Zealand, United States 1
United States, France, South Korea, Indonesia 1
Name: country, Length: 197, dtype: int64
df.loc[df['type']=='Movie', 'rating'].value_counts()
17+ 2861 14+ 1427 Parental Guidance 827 13+ 490 General 167 7+ 144 All Kids 131 Not Rated 83 Name: rating, dtype: int64
df.loc[df['type']=='TV Show', 'rating'].value_counts()
17+ 1147 14+ 733 Parental Guidance 323 7+ 196 All Kids 176 General 94 Not Rated 7 13+ 0 Name: rating, dtype: int64
df.loc[df['type']=='Movie', 'listed_in'].value_counts()
Dramas, International Movies 362
Documentaries 359
Stand-Up Comedy 334
Comedies, Dramas, International Movies 274
Dramas, Independent Movies, International Movies 252
...
Sci-Fi & Fantasy 1
Sports Movies 1
Children & Family Movies, Comedies, Cult Movies 1
Cult Movies, Dramas, Music & Musicals 1
Cult Movies, Dramas, Thrillers 1
Name: listed_in, Length: 278, dtype: int64
df.loc[df['type']=='TV Show', 'listed_in'].value_counts()
Kids' TV 220
International TV Shows, TV Dramas 121
Crime TV Shows, International TV Shows, TV Dramas 110
Kids' TV, TV Comedies 99
Reality TV 95
...
Kids' TV, TV Action & Adventure, TV Dramas 1
British TV Shows, Kids' TV, TV Thrillers 1
Reality TV, TV Horror, TV Thrillers 1
TV Action & Adventure, TV Horror, TV Sci-Fi & Fantasy 1
Classic & Cult TV, Crime TV Shows, TV Dramas 1
Name: listed_in, Length: 236, dtype: int64
df.loc[df['type']=='Movie', 'date_added'].describe(datetime_is_numeric=True)
count 6128 mean 2019-05-07 06:16:41.044386560 min 2008-01-01 00:00:00 25% 2018-04-01 00:00:00 50% 2019-06-19 12:00:00 75% 2020-07-24 00:00:00 max 2021-09-25 00:00:00 Name: date_added, dtype: object
df.loc[df['type']=='TV Show', 'date_added'].describe(datetime_is_numeric=True)
count 2654 mean 2019-06-10 13:43:05.380557568 min 2008-02-04 00:00:00 25% 2018-04-21 18:00:00 50% 2019-08-15 12:00:00 75% 2020-10-01 00:00:00 max 2021-09-24 00:00:00 Name: date_added, dtype: object
df.loc[df['type']=='Movie', 'release_year'].describe().round()
count 6128.0 mean 2013.0 std 10.0 min 1942.0 25% 2012.0 50% 2016.0 75% 2018.0 max 2021.0 Name: release_year, dtype: float64
df.loc[df['type']=='TV Show', 'release_year'].describe().round()
count 2664.0 mean 2017.0 std 6.0 min 1925.0 25% 2016.0 50% 2018.0 75% 2020.0 max 2021.0 Name: release_year, dtype: float64
df.loc[df['type']=='Movie', 'duration'].describe().round()
count 6130.0 mean 100.0 std 28.0 min 3.0 25% 87.0 50% 98.0 75% 114.0 max 312.0 Name: duration, dtype: float64
df.loc[df['type']=='TV Show', 'duration'].describe().round()
count 2676.0 mean 2.0 std 2.0 min 1.0 25% 1.0 50% 1.0 75% 2.0 max 17.0 Name: duration, dtype: float64
Define a Function to Split Nested Data Column into Rows
df.head()
| show_id | type | title | director | cast | country | date_added | release_year | rating | duration | listed_in | description | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | s1 | Movie | Dick Johnson Is Dead | Kirsten Johnson | Unknown | United States | 2021-09-25 | 2020.0 | 13+ | 90 | Documentaries | As her father nears the end of his life, filmm... |
| 1 | s2 | TV Show | Blood & Water | Unknown | Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban... | South Africa | 2021-09-24 | 2021.0 | 17+ | 2 | International TV Shows, TV Dramas, TV Mysteries | After crossing paths at a party, a Cape Town t... |
| 2 | s3 | TV Show | Ganglands | Julien Leclercq | Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi... | Unknown | 2021-09-24 | 2021.0 | 17+ | 1 | Crime TV Shows, International TV Shows, TV Act... | To protect his family from a powerful drug lor... |
| 3 | s4 | TV Show | Jailbirds New Orleans | Unknown | Unknown | Unknown | 2021-09-24 | 2021.0 | 17+ | 1 | Docuseries, Reality TV | Feuds, flirtations and toilet talk go down amo... |
| 4 | s5 | TV Show | Kota Factory | Unknown | Mayur More, Jitendra Kumar, Ranjan Raj, Alam K... | India | 2021-09-24 | 2021.0 | 17+ | 2 | International TV Shows, Romantic TV Shows, TV ... | In a city of coaching centers known to train I... |
def unnested_column(df, id_col, nested_col):
df1 = pd.concat((df[id_col], df[nested_col].str.split(',')), axis=1).explode(nested_col)
''' Alternate approach
df1 = pd.concat((df[id_col], df[nested_col].str.split(',', expand=True)), axis=1)
df1 = pd.melt(df1.reset_index(), id_vars=['index', id_col], value_name=nested_col).dropna().sort_values('index').drop(['index', 'variable'], axis=1)
'''
df1[nested_col] = df1[nested_col].str.lstrip().str.rstrip()
df = df.drop(nested_col, axis=1)
return [df, df1]
Unnest director, cast, country and listed_in Columns
# df, df_listed_in = unnested_column(df, 'show_id', 'listed_in')
# df, df_director = unnested_column(df, 'show_id', 'director')
# df, df_cast = unnested_column(df, 'show_id', 'cast')
# df, df_country = unnested_column(df, 'show_id', 'country')
Merge the unnested columns with the main df
# df = pd.merge(df, df_director, on='show_id')
# df = pd.merge(df, df_cast, on='show_id')
# df = pd.merge(df, df_country, on='show_id')
# df = pd.merge(df, df_listed_in, on='show_id')
# del df_director, df_cast, df_country, df_listed_in
Alternate and efficient approach to get unnested columns and then melt is to use split explode
def unnest_column_approach01(df_, id_col, nested_col):
df_ = pd.concat((df_.drop(nested_col, axis=1), df_[nested_col].str.split(',')), axis=1).explode(nested_col)
df_[nested_col] = df_[nested_col].str.lstrip().str.rstrip()
return df_.reset_index(drop=True)
df = unnest_column_approach01(df, 'show_id', 'listed_in')
df = unnest_column_approach01(df, 'show_id', 'director')
df = unnest_column_approach01(df, 'show_id', 'cast')
df = unnest_column_approach01(df, 'show_id', 'country')
Rearrange the Columns
df.head()
| show_id | type | title | date_added | release_year | rating | duration | description | listed_in | director | cast | country | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | s1 | Movie | Dick Johnson Is Dead | 2021-09-25 | 2020.0 | 13+ | 90 | As her father nears the end of his life, filmm... | Documentaries | Kirsten Johnson | Unknown | United States |
| 1 | s2 | TV Show | Blood & Water | 2021-09-24 | 2021.0 | 17+ | 2 | After crossing paths at a party, a Cape Town t... | International TV Shows | Unknown | Ama Qamata | South Africa |
| 2 | s2 | TV Show | Blood & Water | 2021-09-24 | 2021.0 | 17+ | 2 | After crossing paths at a party, a Cape Town t... | International TV Shows | Unknown | Khosi Ngema | South Africa |
| 3 | s2 | TV Show | Blood & Water | 2021-09-24 | 2021.0 | 17+ | 2 | After crossing paths at a party, a Cape Town t... | International TV Shows | Unknown | Gail Mabalane | South Africa |
| 4 | s2 | TV Show | Blood & Water | 2021-09-24 | 2021.0 | 17+ | 2 | After crossing paths at a party, a Cape Town t... | International TV Shows | Unknown | Thabang Molaba | South Africa |
df = df[['show_id', 'type', 'title', 'director', 'cast', 'country', 'listed_in', 'duration', 'rating', 'date_added', 'release_year', 'description']]
df.head(10)
| show_id | type | title | director | cast | country | listed_in | duration | rating | date_added | release_year | description | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | s1 | Movie | Dick Johnson Is Dead | Kirsten Johnson | Unknown | United States | Documentaries | 90 | 13+ | 2021-09-25 | 2020.0 | As her father nears the end of his life, filmm... |
| 1 | s2 | TV Show | Blood & Water | Unknown | Ama Qamata | South Africa | International TV Shows | 2 | 17+ | 2021-09-24 | 2021.0 | After crossing paths at a party, a Cape Town t... |
| 2 | s2 | TV Show | Blood & Water | Unknown | Khosi Ngema | South Africa | International TV Shows | 2 | 17+ | 2021-09-24 | 2021.0 | After crossing paths at a party, a Cape Town t... |
| 3 | s2 | TV Show | Blood & Water | Unknown | Gail Mabalane | South Africa | International TV Shows | 2 | 17+ | 2021-09-24 | 2021.0 | After crossing paths at a party, a Cape Town t... |
| 4 | s2 | TV Show | Blood & Water | Unknown | Thabang Molaba | South Africa | International TV Shows | 2 | 17+ | 2021-09-24 | 2021.0 | After crossing paths at a party, a Cape Town t... |
| 5 | s2 | TV Show | Blood & Water | Unknown | Dillon Windvogel | South Africa | International TV Shows | 2 | 17+ | 2021-09-24 | 2021.0 | After crossing paths at a party, a Cape Town t... |
| 6 | s2 | TV Show | Blood & Water | Unknown | Natasha Thahane | South Africa | International TV Shows | 2 | 17+ | 2021-09-24 | 2021.0 | After crossing paths at a party, a Cape Town t... |
| 7 | s2 | TV Show | Blood & Water | Unknown | Arno Greeff | South Africa | International TV Shows | 2 | 17+ | 2021-09-24 | 2021.0 | After crossing paths at a party, a Cape Town t... |
| 8 | s2 | TV Show | Blood & Water | Unknown | Xolile Tshabalala | South Africa | International TV Shows | 2 | 17+ | 2021-09-24 | 2021.0 | After crossing paths at a party, a Cape Town t... |
| 9 | s2 | TV Show | Blood & Water | Unknown | Getmore Sithole | South Africa | International TV Shows | 2 | 17+ | 2021-09-24 | 2021.0 | After crossing paths at a party, a Cape Town t... |
Define a function to Dist Plot and Count Plot by Type of Show
def plt_dist_plot(ax_, df_, var_, title_, xlabel_, ylabel_):
df_sub_count = df_[['type', var_]].value_counts()
df_sub_count.name = xlabel_
df_sub_count = df_sub_count.reset_index()
sns.kdeplot(data=df_sub_count, x=xlabel_, ax=ax_,
hue='type', hue_order=['TV Show', 'Movie'], lw=4, palette=['#3A5BA0', '#DC3535'])
plt.sca(ax_)
plt.title(title_, size=16, color='grey')
plt.xlabel(xlabel_, size=12)
plt.ylabel(ylabel_, size=12)
plt.xticks(size=12)
plt.yticks(size=12)
plt.legend(labels=['Movie', 'TV Show'], loc='best', fontsize=12)
sns.despine(right=True, top=True, ax=ax_)
ax_.spines['left'].set_color('grey')
ax_.spines['bottom'].set_color('grey')
ax_.figure.set_size_inches(16,8)
ax_.figure.subplots_adjust(top=0.81,right=0.86)
return plt
def plt_most10_count_plot(ax_, df_, var_, show_type, title_, independant_label_, dependant_label_, unit_='', highlight_num=3, highlight_color='#DC3535', orientation_='v'):
list_most10 = df_.loc[df_['type']==show_type, var_].value_counts().iloc[:10].index
df_ = df_.loc[(df_['type']==show_type) & df_[var_].isin(list_most10), [var_]]
custom_palette = [highlight_color for i in range(highlight_num)] + ['grey' for i in range(10-highlight_num)]
if orientation_ == 'v':
sns.countplot(data=df_, x=var_, order=list_most10, ax=ax_,
palette = custom_palette)
plt.sca(ax_)
plt.xlabel(independant_label_)
plt.ylabel(dependant_label_)
plt.tick_params(
axis='y', # changes apply to the y-axis
which='both', # both major and minor ticks are affected
left=False, # ticks along the left edge are off
right=False, # ticks along the right edge are off
labelleft=False) # labels along the left edge are off
labels = []
for label in ax_.get_xticklabels():
text = label.get_text()
labels.append(textwrap.fill(text, width=10,
break_long_words=False))
ax_.set_xticklabels(labels, rotation=0)
bar_labels=[]
for container in ax_.containers:
for rect in container:
# Rectangle widths are already integer-valued but are floating
# type, so it helps to remove the trailing decimal point and 0 by
# converting width to int type
# Shift the text to the left side of the right edge
yloc = 4
# White on magenta
clr = 'white'
align = 'bottom'
rotation_ = 90
label_text = '' if not rect.get_height() else round(rect.get_height()) if rect.get_height() > 0 and rect.get_height() < 1000 else f'{rect.get_height()/1000: .1f} k'
label_text = f'{label_text} {unit_}' if label_text and rect.get_height()/container[0].get_height() > 0.05+0.022*len(unit_) else label_text if rect.get_height()/container[0].get_height() > 0.05 else ''
size_ = 12 if rect.get_height()/container[0].get_height() > 0.1 else 9 if rect.get_height()/container[0].get_height() > 0.06 else 8 if rect.get_height()/container[0].get_height() > 0.055 else 7
# Center the text vertically in the bar
xloc = rect.get_x() + rect.get_width() / 2
ax_.annotate(label_text, xy=(xloc, 0), xytext=(0, yloc),
textcoords="offset points", size=size_,
ha='center', va=align, rotation=rotation_,
color=clr, clip_on=True)
else:
sns.countplot(data=df_, y=var_, order=list_most10, ax=ax_,
palette = custom_palette)
plt.sca(ax_)
plt.xlabel(dependant_label_)
plt.ylabel(independant_label_)
plt.tick_params(
axis='x', # changes apply to the x-axis
which='both', # both major and minor ticks are affected
bottom=False, # ticks along the bottom edge are off
top=False, # ticks along the top edge are off
labelbottom=False) # labels along the bottom edge are off
labels = []
for label in ax_.get_yticklabels():
text = label.get_text()
labels.append(textwrap.fill(text, width=15,
break_long_words=False))
ax_.set_yticklabels(labels, rotation=0)
bar_labels=[]
for container in ax_.containers:
for rect in container:
# Rectangle widths are already integer-valued but are floating
# type, so it helps to remove the trailing decimal point and 0 by
# converting width to int type
# Shift the text to the left side of the right edge
xloc = 2
# White on magenta
clr = 'white'
align = 'left'
label_text = '' if not rect.get_width() else round(rect.get_width()) if rect.get_width() > 0 and rect.get_width() < 1000 else f'{rect.get_width()/1000: .1f} k'
label_text = f'{label_text} {unit_}' if label_text and rect.get_width()/container[0].get_width() > 0.05+0.022*len(unit_) else label_text if rect.get_width()/container[0].get_width() > 0.05 else ''
size_ = 12 if rect.get_width()/container[0].get_width() > 0.1 else 9 if rect.get_width()/container[0].get_width() > 0.06 else 8 if rect.get_width()/container[0].get_width() > 0.055 else 7
# Center the text vertically in the bar
yloc = rect.get_y() + rect.get_height() / 2
ax_.annotate(label_text, xy=(0, yloc), xytext=(xloc, 0),
textcoords="offset points", size=size_,
ha=align, va='center',
color=clr, clip_on=True)
sns.despine(left=True, bottom=True, ax=ax_)
plt.title(title_, size=16, color='grey')
plt.xticks(size=12)
plt.yticks(size=12)
ax_.spines['left'].set_color('grey')
ax_.spines['bottom'].set_color('grey')
ax_.figure.set_size_inches(16,8)
ax_.figure.subplots_adjust(top=0.81,right=0.86)
return plt
Director by Type of Show
df_sub = df[['show_id', 'type', 'director']].drop(df.loc[df['director']=='Unknown'].index).drop_duplicates()
fig, axd = plt.subplot_mosaic([['upper', 'upper'],
['lower left', 'lower right']])
ax1 = axd['upper']
ax2 = axd['lower left']
ax3 = axd['lower right']
plt_dist_plot(ax_=ax1, df_=df_sub, var_='director', title_='Shows per Director Count', xlabel_='Shows per Director', ylabel_='Density')
plt_most10_count_plot(ax_=ax2, df_=df_sub, var_='director', show_type='Movie', title_='Most Watched Directors', independant_label_='', dependant_label_='', unit_='Movies', highlight_num=3, highlight_color='#DC3535', orientation_='h')
plt_most10_count_plot(ax_=ax3, df_=df_sub, var_='director', show_type='TV Show', title_='Most Watched Directors', independant_label_='', dependant_label_='', unit_='TV Shows', highlight_num=2, highlight_color='#3A5BA0', orientation_='h')
fig.set_size_inches(16,16)
plt.subplots_adjust(hspace = 0.25, wspace=0.25)
plt.show()
Cast by Type of Show
df_sub = df[['show_id', 'type', 'cast']].drop(df.loc[df['cast']=='Unknown'].index).drop_duplicates()
fig, axd = plt.subplot_mosaic([['upper', 'upper'],
['lower left', 'lower right']])
ax1 = axd['upper']
ax2 = axd['lower left']
ax3 = axd['lower right']
plt_dist_plot(ax_=ax1, df_=df_sub, var_='cast', title_='Shows per Actor Count', xlabel_='Shows per Actor', ylabel_='Density')
plt_most10_count_plot(ax_=ax2, df_=df_sub, var_='cast', show_type='Movie', title_='Most Watched Actors', independant_label_='', dependant_label_='', unit_='Movies', highlight_num=5, highlight_color='#DC3535', orientation_='h')
plt_most10_count_plot(ax_=ax3, df_=df_sub, var_='cast', show_type='TV Show', title_='Most Watched Actors', independant_label_='', dependant_label_='', unit_='TV Shows', highlight_num=5, highlight_color='#3A5BA0', orientation_='h')
fig.set_size_inches(16,16)
plt.subplots_adjust(hspace = 0.25, wspace=0.25)
plt.show()
Actor & Director by Type of Show
df_sub = df[['show_id', 'type', 'director', 'cast']].drop(df.loc[(df['director']=='Unknown')|(df['cast']=='Unknown')].index).drop_duplicates()
df_sub['actor_director'] = df['cast'] + ' & ' + df['director']
fig, axd = plt.subplot_mosaic([['upper', 'upper'],
['lower left', 'lower right']])
ax1 = axd['upper']
ax2 = axd['lower left']
ax3 = axd['lower right']
plt_dist_plot(ax_=ax1, df_=df_sub, var_='actor_director', title_='Shows per Actor & Director Count', xlabel_='Shows per Actor & Director', ylabel_='Density')
plt_most10_count_plot(ax_=ax2, df_=df_sub, var_='actor_director', show_type='Movie', title_='Most Watched Actor & Director', independant_label_='', dependant_label_='', unit_='Movies', highlight_num=5, highlight_color='#DC3535', orientation_='h')
plt_most10_count_plot(ax_=ax3, df_=df_sub, var_='actor_director', show_type='TV Show', title_='Most Watched Actor & Director', independant_label_='', dependant_label_='', unit_='TV Shows', highlight_num=7, highlight_color='#3A5BA0', orientation_='h')
fig.set_size_inches(16,16)
plt.subplots_adjust(hspace = 0.25, wspace=0.25)
plt.show()
Genre by Type of Show
df_sub = df[['show_id', 'type', 'listed_in']].drop(df.loc[df['listed_in']=='Unknown'].index).drop_duplicates()
fig, axd = plt.subplot_mosaic([['upper', 'upper'],
['lower left', 'lower right']])
ax1 = axd['upper']
ax2 = axd['lower left']
ax3 = axd['lower right']
plt_dist_plot(ax_=ax1, df_=df_sub, var_='listed_in', title_='Shows per Genre Count', xlabel_='Shows per Genre', ylabel_='Density')
plt_most10_count_plot(ax_=ax2, df_=df_sub, var_='listed_in', show_type='Movie', title_='Most Watched Genre', independant_label_='', dependant_label_='', unit_='Movies', highlight_num=3, highlight_color='#DC3535', orientation_='h')
plt_most10_count_plot(ax_=ax3, df_=df_sub, var_='listed_in', show_type='TV Show', title_='Most Watched Genre', independant_label_='', dependant_label_='', unit_='TV Shows', highlight_num=3, highlight_color='#3A5BA0', orientation_='h')
fig.set_size_inches(16,16)
plt.subplots_adjust(hspace = 0.25, wspace=0.25)
plt.show()
Rating by Type of Show
df_sub = df[['show_id', 'type', 'rating']].drop_duplicates()
fig, axd = plt.subplot_mosaic([['upper', 'upper'],
['lower left', 'lower right']])
ax1 = axd['upper']
ax2 = axd['lower left']
ax3 = axd['lower right']
plt_dist_plot(ax_=ax1, df_=df_sub, var_='rating', title_='Shows per Rating Count', xlabel_='Shows per Rating', ylabel_='Density')
plt_most10_count_plot(ax_=ax2, df_=df_sub, var_='rating', show_type='Movie', title_='Most Watched Rating', independant_label_='', dependant_label_='', unit_='Movies', highlight_num=3, highlight_color='#DC3535', orientation_='h')
plt_most10_count_plot(ax_=ax3, df_=df_sub, var_='rating', show_type='TV Show', title_='Most Watched Rating', independant_label_='', dependant_label_='', unit_='TV Show', highlight_num=3, highlight_color='#3A5BA0', orientation_='h')
fig.set_size_inches(16,16)
plt.subplots_adjust(hspace = 0.25, wspace=0.25)
plt.show()
Duration by Type of Show
df_sub = df[['show_id', 'type', 'duration']].drop_duplicates()
df_sub.loc[df['type']=='Movie', 'duration'] = np.ceil(df_sub.loc[df['type']=='Movie', 'duration']/10)*10
fig, axd = plt.subplot_mosaic([['upper', 'upper'],
['lower left', 'lower right']])
ax1 = axd['upper']
ax2 = axd['lower left']
ax3 = axd['lower right']
plt_dist_plot(ax_=ax1, df_=df_sub, var_='duration', title_='Shows per Duration', xlabel_='Shows per Duration', ylabel_='Density')
plt_most10_count_plot(ax_=ax2, df_=df_sub, var_='duration', show_type='Movie', title_='Most Shows Duration', independant_label_='', dependant_label_='', unit_='Movies', highlight_num=3, highlight_color='#DC3535', orientation_='h')
plt_most10_count_plot(ax_=ax3, df_=df_sub, var_='duration', show_type='TV Show', title_='Most Shows Duration', independant_label_='', dependant_label_='', unit_='TV Show', highlight_num=3, highlight_color='#3A5BA0', orientation_='h')
fig.set_size_inches(16,16)
plt.subplots_adjust(hspace = 0.25, wspace=0.25)
plt.show()
Date Added by Type of Show
df_sub = df[['show_id', 'type', 'date_added']].drop_duplicates()
df_sub['date_added'] = df_sub['date_added'].dt.year
fig, axd = plt.subplot_mosaic([['upper', 'upper'],
['lower left', 'lower right']])
ax1 = axd['upper']
ax2 = axd['lower left']
ax3 = axd['lower right']
plt_dist_plot(ax_=ax1, df_=df_sub, var_='date_added', title_='Shows per Date Added Count (Year)', xlabel_='Shows per Date Added (Year) ', ylabel_='Density')
plt_most10_count_plot(ax_=ax2, df_=df_sub, var_='date_added', show_type='Movie', title_='Most Shows Date Added ', independant_label_='', dependant_label_='', unit_='Movies', highlight_num=3, highlight_color='#DC3535', orientation_='h')
plt_most10_count_plot(ax_=ax3, df_=df_sub, var_='date_added', show_type='TV Show', title_='Most Shows Date Added ', independant_label_='', dependant_label_='', unit_='TV Show', highlight_num=3, highlight_color='#3A5BA0', orientation_='h')
fig.set_size_inches(16,16)
plt.subplots_adjust(hspace = 0.25, wspace=0.25)
plt.show()
Release Year by Type of Show
df_sub = df[['show_id', 'type', 'release_year']].drop_duplicates()
# df_sub['date_added'] = df_sub['release_year'].dt.year
fig, axd = plt.subplot_mosaic([['upper', 'upper'],
['lower left', 'lower right']])
ax1 = axd['upper']
ax2 = axd['lower left']
ax3 = axd['lower right']
plt_dist_plot(ax_=ax1, df_=df_sub, var_='release_year', title_='Shows per Release Year', xlabel_='Shows per Release Year', ylabel_='Density')
plt_most10_count_plot(ax_=ax2, df_=df_sub, var_='release_year', show_type='Movie', title_='Most Shows Release Year', independant_label_='', dependant_label_='', unit_='Movies', highlight_num=4, highlight_color='#DC3535', orientation_='h')
plt_most10_count_plot(ax_=ax3, df_=df_sub, var_='release_year', show_type='TV Show', title_='Most Shows Release Year', independant_label_='', dependant_label_='', unit_='TV Show', highlight_num=4, highlight_color='#3A5BA0', orientation_='h')
fig.set_size_inches(16,16)
plt.subplots_adjust(hspace = 0.25, wspace=0.25)
plt.show()
Country by Type of Show
df_sub = df[['show_id', 'type', 'country']].drop(df.loc[df['country']=='Unknown'].index).drop_duplicates()
fig, axd = plt.subplot_mosaic([['upper', 'upper'],
['lower left', 'lower right']])
ax1 = axd['upper']
ax2 = axd['lower left']
ax3 = axd['lower right']
plt_dist_plot(ax_=ax1, df_=df_sub, var_='country', title_='Shows per Country Count', xlabel_='Shows per Country', ylabel_='Density')
plt_most10_count_plot(ax_=ax2, df_=df_sub, var_='country', show_type='Movie', title_='Most Shows per Country', independant_label_='', dependant_label_='', unit_='Movies', highlight_num=5, highlight_color='#DC3535', orientation_='h')
plt_most10_count_plot(ax_=ax3, df_=df_sub, var_='country', show_type='TV Show', title_='Most Shows per Country', independant_label_='', dependant_label_='', unit_='TV Shows', highlight_num=5, highlight_color='#3A5BA0', orientation_='h')
fig.set_size_inches(16,16)
plt.subplots_adjust(hspace = 0.25, wspace=0.25)
plt.show()
def country_wise_analysis(country_):
print('Director by Type of Show')
df_sub = df[['show_id', 'country', 'type', 'director']].drop(df.loc[df['director']=='Unknown'].index).drop_duplicates()
df_sub = df_sub.loc[df_sub['country']==country_]
fig, axd = plt.subplot_mosaic([['upper', 'upper'],
['lower left', 'lower right']])
ax1 = axd['upper']
ax2 = axd['lower left']
ax3 = axd['lower right']
plt_dist_plot(ax_=ax1, df_=df_sub, var_='director', title_=f'Shows per Director Count ({country_})', xlabel_='Shows per Director', ylabel_='Density')
plt_most10_count_plot(ax_=ax2, df_=df_sub, var_='director', show_type='Movie', title_='Most Watched Directors', independant_label_='', dependant_label_='', unit_='Movies', highlight_num=3, highlight_color='#DC3535', orientation_='h')
plt_most10_count_plot(ax_=ax3, df_=df_sub, var_='director', show_type='TV Show', title_='Most Watched Directors', independant_label_='', dependant_label_='', unit_='TV Shows', highlight_num=3, highlight_color='#3A5BA0', orientation_='h')
fig.set_size_inches(16,16)
plt.subplots_adjust(hspace = 0.25, wspace=0.25)
plt.show()
# ------------------------------------------
print('Cast by Type of Show')
df_sub = df[['show_id', 'country', 'type', 'cast']].drop(df.loc[df['cast']=='Unknown'].index).drop_duplicates()
df_sub = df_sub.loc[df_sub['country']==country_]
fig, axd = plt.subplot_mosaic([['upper', 'upper'],
['lower left', 'lower right']])
ax1 = axd['upper']
ax2 = axd['lower left']
ax3 = axd['lower right']
plt_dist_plot(ax_=ax1, df_=df_sub, var_='cast', title_=f'Shows per Actor Count ({country_})', xlabel_='Shows per Actor', ylabel_='Density')
plt_most10_count_plot(ax_=ax2, df_=df_sub, var_='cast', show_type='Movie', title_='Most Watched Actors', independant_label_='', dependant_label_='', unit_='Movies', highlight_num=3, highlight_color='#DC3535', orientation_='h')
plt_most10_count_plot(ax_=ax3, df_=df_sub, var_='cast', show_type='TV Show', title_='Most Watched Actors', independant_label_='', dependant_label_='', unit_='TV Shows', highlight_num=3, highlight_color='#3A5BA0', orientation_='h')
fig.set_size_inches(16,16)
plt.subplots_adjust(hspace = 0.25, wspace=0.25)
plt.show()
# ------------------------------------------
print('Actor & Director by Type of Show')
df_sub = df[['show_id', 'country', 'type', 'director', 'cast']].drop(df.loc[(df['director']=='Unknown')|(df['cast']=='Unknown')].index).drop_duplicates()
df_sub = df_sub.loc[df_sub['country']==country_]
df_sub['actor_director'] = df['cast'] + ' & ' + df['director']
fig, axd = plt.subplot_mosaic([['upper', 'upper'],
['lower left', 'lower right']])
ax1 = axd['upper']
ax2 = axd['lower left']
ax3 = axd['lower right']
plt_dist_plot(ax_=ax1, df_=df_sub, var_='actor_director', title_=f'Shows per Actor & Director Count ({country_})', xlabel_='Shows per Actor & Director', ylabel_='Density')
plt_most10_count_plot(ax_=ax2, df_=df_sub, var_='actor_director', show_type='Movie', title_='Most Watched Actor & Director', independant_label_='', dependant_label_='', unit_='Movies', highlight_num=3, highlight_color='#DC3535', orientation_='h')
plt_most10_count_plot(ax_=ax3, df_=df_sub, var_='actor_director', show_type='TV Show', title_='Most Watched Actor & Director', independant_label_='', dependant_label_='', unit_='TV Shows', highlight_num=3, highlight_color='#3A5BA0', orientation_='h')
fig.set_size_inches(16,16)
plt.subplots_adjust(hspace = 0.25, wspace=0.25)
plt.show()
# ------------------------------------------
print('Genre by Type of Show')
df_sub = df[['show_id', 'country', 'type', 'listed_in']].drop(df.loc[df['listed_in']=='Unknown'].index).drop_duplicates()
df_sub = df_sub.loc[df_sub['country']==country_]
fig, axd = plt.subplot_mosaic([['upper', 'upper'],
['lower left', 'lower right']])
ax1 = axd['upper']
ax2 = axd['lower left']
ax3 = axd['lower right']
plt_dist_plot(ax_=ax1, df_=df_sub, var_='listed_in', title_=f'Shows per Genre Count ({country_})', xlabel_='Shows per Genre', ylabel_='Density')
plt_most10_count_plot(ax_=ax2, df_=df_sub, var_='listed_in', show_type='Movie', title_='Most Watched Genre', independant_label_='', dependant_label_='', unit_='Movies', highlight_num=3, highlight_color='#DC3535', orientation_='h')
plt_most10_count_plot(ax_=ax3, df_=df_sub, var_='listed_in', show_type='TV Show', title_='Most Watched Genre', independant_label_='', dependant_label_='', unit_='TV Shows', highlight_num=3, highlight_color='#3A5BA0', orientation_='h')
fig.set_size_inches(16,16)
plt.subplots_adjust(hspace = 0.25, wspace=0.25)
plt.show()
# ------------------------------------------
print('Rating by Type of Show')
df_sub = df[['show_id', 'country', 'type', 'rating']].drop_duplicates()
df_sub = df_sub.loc[df_sub['country']==country_]
fig, axd = plt.subplot_mosaic([['upper', 'upper'],
['lower left', 'lower right']])
ax1 = axd['upper']
ax2 = axd['lower left']
ax3 = axd['lower right']
plt_dist_plot(ax_=ax1, df_=df_sub, var_='rating', title_=f'Shows per Rating Count ({country_})', xlabel_='Shows per Rating', ylabel_='Density')
plt_most10_count_plot(ax_=ax2, df_=df_sub, var_='rating', show_type='Movie', title_='Most Watched Rating', independant_label_='', dependant_label_='', unit_='Movies', highlight_num=3, highlight_color='#DC3535', orientation_='h')
plt_most10_count_plot(ax_=ax3, df_=df_sub, var_='rating', show_type='TV Show', title_='Most Watched Rating', independant_label_='', dependant_label_='', unit_='TV Show', highlight_num=3, highlight_color='#3A5BA0', orientation_='h')
fig.set_size_inches(16,16)
plt.subplots_adjust(hspace = 0.25, wspace=0.25)
plt.show()
# ------------------------------------------
print('Duration by Type of Show')
df_sub = df[['show_id', 'country', 'type', 'duration']].drop_duplicates()
df_sub = df_sub.loc[df_sub['country']==country_]
df_sub.loc[df['type']=='Movie', 'duration'] = np.ceil(df_sub.loc[df['type']=='Movie', 'duration']/10)*10
fig, axd = plt.subplot_mosaic([['upper', 'upper'],
['lower left', 'lower right']])
ax1 = axd['upper']
ax2 = axd['lower left']
ax3 = axd['lower right']
plt_dist_plot(ax_=ax1, df_=df_sub, var_='duration', title_=f'Shows per Duration ({country_})', xlabel_='Shows per Duration', ylabel_='Density')
plt_most10_count_plot(ax_=ax2, df_=df_sub, var_='duration', show_type='Movie', title_='Most Shows Duration', independant_label_='', dependant_label_='', unit_='Movies', highlight_num=3, highlight_color='#DC3535', orientation_='h')
plt_most10_count_plot(ax_=ax3, df_=df_sub, var_='duration', show_type='TV Show', title_='Most Shows Duration', independant_label_='', dependant_label_='', unit_='TV Show', highlight_num=3, highlight_color='#3A5BA0', orientation_='h')
fig.set_size_inches(16,16)
plt.subplots_adjust(hspace = 0.25, wspace=0.25)
plt.show()
# ------------------------------------------
print('Date Added by Type of Show')
df_sub = df[['show_id', 'country', 'type', 'date_added']].drop_duplicates()
df_sub = df_sub.loc[df_sub['country']==country_]
df_sub['date_added'] = df_sub['date_added'].dt.year
fig, axd = plt.subplot_mosaic([['upper', 'upper'],
['lower left', 'lower right']])
ax1 = axd['upper']
ax2 = axd['lower left']
ax3 = axd['lower right']
plt_dist_plot(ax_=ax1, df_=df_sub, var_='date_added', title_=f'Shows per Date Added Count Yearly ({country_})', xlabel_='Shows per Date Added (Year) ', ylabel_='Density')
plt_most10_count_plot(ax_=ax2, df_=df_sub, var_='date_added', show_type='Movie', title_='Most Shows Date Added ', independant_label_='', dependant_label_='', unit_='Movies', highlight_num=3, highlight_color='#DC3535', orientation_='h')
plt_most10_count_plot(ax_=ax3, df_=df_sub, var_='date_added', show_type='TV Show', title_='Most Shows Date Added ', independant_label_='', dependant_label_='', unit_='TV Show', highlight_num=3, highlight_color='#3A5BA0', orientation_='h')
fig.set_size_inches(16,16)
plt.subplots_adjust(hspace = 0.25, wspace=0.25)
plt.show()
# ------------------------------------------
print('Release Year by Type of Show')
df_sub = df[['show_id', 'country', 'type', 'release_year']].drop_duplicates()
df_sub = df_sub.loc[df_sub['country']==country_]
fig, axd = plt.subplot_mosaic([['upper', 'upper'],
['lower left', 'lower right']])
ax1 = axd['upper']
ax2 = axd['lower left']
ax3 = axd['lower right']
plt_dist_plot(ax_=ax1, df_=df_sub, var_='release_year', title_=f'Shows per Release Year ({country_})', xlabel_='Shows per Release Year', ylabel_='Density')
plt_most10_count_plot(ax_=ax2, df_=df_sub, var_='release_year', show_type='Movie', title_='Most Shows Release Year', independant_label_='', dependant_label_='', unit_='Movies', highlight_num=3, highlight_color='#DC3535', orientation_='h')
plt_most10_count_plot(ax_=ax3, df_=df_sub, var_='release_year', show_type='TV Show', title_='Most Shows Release Year', independant_label_='', dependant_label_='', unit_='TV Show', highlight_num=3, highlight_color='#3A5BA0', orientation_='h')
fig.set_size_inches(16,16)
plt.subplots_adjust(hspace = 0.25, wspace=0.25)
plt.show()
country_wise_analysis('United States')
Director by Type of Show
Cast by Type of Show
Actor & Director by Type of Show
Genre by Type of Show
Rating by Type of Show
Duration by Type of Show
Date Added by Type of Show
Release Year by Type of Show
country_wise_analysis('India')
Director by Type of Show
Cast by Type of Show
Actor & Director by Type of Show
Genre by Type of Show
Rating by Type of Show
Duration by Type of Show
Date Added by Type of Show
Release Year by Type of Show
country_wise_analysis('United Kingdom')
Director by Type of Show
Cast by Type of Show
Actor & Director by Type of Show
Genre by Type of Show
Rating by Type of Show
Duration by Type of Show
Date Added by Type of Show
Release Year by Type of Show
country_wise_analysis('Canada')
Director by Type of Show
/usr/local/lib/python3.8/dist-packages/seaborn/distributions.py:316: UserWarning: Dataset has 0 variance; skipping density estimate. Pass `warn_singular=False` to disable this warning. warnings.warn(msg, UserWarning)
Cast by Type of Show
Actor & Director by Type of Show
/usr/local/lib/python3.8/dist-packages/seaborn/distributions.py:316: UserWarning: Dataset has 0 variance; skipping density estimate. Pass `warn_singular=False` to disable this warning. warnings.warn(msg, UserWarning)
Genre by Type of Show
Rating by Type of Show
Duration by Type of Show
Date Added by Type of Show
Release Year by Type of Show
country_wise_analysis('France')
Director by Type of Show
/usr/local/lib/python3.8/dist-packages/seaborn/distributions.py:316: UserWarning: Dataset has 0 variance; skipping density estimate. Pass `warn_singular=False` to disable this warning. warnings.warn(msg, UserWarning)
Cast by Type of Show
Actor & Director by Type of Show
/usr/local/lib/python3.8/dist-packages/seaborn/distributions.py:316: UserWarning: Dataset has 0 variance; skipping density estimate. Pass `warn_singular=False` to disable this warning. warnings.warn(msg, UserWarning)
Genre by Type of Show
Rating by Type of Show
Duration by Type of Show
Date Added by Type of Show
Release Year by Type of Show
country_wise_analysis('Japan')
Director by Type of Show
/usr/local/lib/python3.8/dist-packages/seaborn/distributions.py:316: UserWarning: Dataset has 0 variance; skipping density estimate. Pass `warn_singular=False` to disable this warning. warnings.warn(msg, UserWarning)
Cast by Type of Show
Actor & Director by Type of Show
/usr/local/lib/python3.8/dist-packages/seaborn/distributions.py:316: UserWarning: Dataset has 0 variance; skipping density estimate. Pass `warn_singular=False` to disable this warning. warnings.warn(msg, UserWarning)
Genre by Type of Show
Rating by Type of Show
Duration by Type of Show
Date Added by Type of Show
Release Year by Type of Show
country_wise_analysis('South Korea')
Director by Type of Show
Cast by Type of Show
Actor & Director by Type of Show
Genre by Type of Show
Rating by Type of Show
Duration by Type of Show
Date Added by Type of Show
Release Year by Type of Show
def plot_hist_box(var_, title_, xlabel_, movie_x_major, movie_x_minor, tv_x_major, tv_x_minor, is_date=False):
df_ = df[['show_id', 'type', var_]].drop_duplicates()
if is_date:
df_[var_] = df_[var_].dt.year
# ----------------------------------------
m1 = df_.loc[df['type']=='Movie', var_].mean()
st1 = df_.loc[df['type']=='Movie', var_].std()
q1 = df_.loc[df['type']=='Movie', var_].quantile(.25)
q2 = df_.loc[df['type']=='Movie', var_].median()
q3 = df_.loc[df['type']=='Movie', var_].quantile(.75)
#
ax1 = plt.subplot(2,2,1)
sns.histplot(data=df_.loc[df['type']=='Movie'], x=var_,
binwidth=movie_x_minor,
color='#DC3535', ax=ax1, linewidth=2)
df_mean = pd.DataFrame({'x': [m1, m1], 'y': ax1.get_ybound()})
df_q1 = pd.DataFrame({'x': [q1, q1], 'y': ax1.get_ybound()})
df_q2 = pd.DataFrame({'x': [q2, q2], 'y': ax1.get_ybound()})
df_q3 = pd.DataFrame({'x': [q3, q3], 'y': ax1.get_ybound()})
sns.lineplot(data=df_mean, x='x', y='y', color='red', ax=ax1, linestyle='--',
estimator=None, linewidth = 2)
sns.lineplot(data=df_q1, x='x', y='y', color='black', ax=ax1, linestyle='--',
estimator=None, linewidth = 1)
sns.lineplot(data=df_q2, x='x', y='y', color='cyan', ax=ax1, linestyle='--',
estimator=None, linewidth = 2)
sns.lineplot(data=df_q3, x='x', y='y', color='black', ax=ax1, linestyle='--',
estimator=None, linewidth = 1)
plt.sca(ax1)
plt.title(f'{title_} for Movies', size=16, color='grey')
plt.tick_params(
axis='x', # changes apply to the x-axis
which='both', # both major and minor ticks are affected
bottom=False, # ticks along the bottom edge are off
top=False, # ticks along the top edge are off
labelbottom=False) # labels along the bottom edge are off
plt.xlabel('')
plt.ylabel('Count', size=12)
plt.yticks(size=12)
sns.despine(bottom=True, left=False, trim=True, ax=ax1)
ax1.spines['left'].set_color('grey')
ax1.xaxis.set_major_locator(MultipleLocator(movie_x_major))
ax1.xaxis.set_major_formatter('{x:.0f}')
ax1.xaxis.set_minor_locator(MultipleLocator(movie_x_minor))
#
df_mean['y'] = [-0.3, 0.2]
df_q1['y'] = [0.2, 0.25]
df_q2['y'] = [0.1, 0.25]
df_q3['y'] = [0.2, 0.25]
ax2 = plt.subplot(2,2,3, sharex=ax1)
sns.boxplot(data=df_.loc[df['type']=='Movie'], x=var_, ax=ax2,
color='#DC3535', showmeans=True,
# color='white',
flierprops={"marker": "x"}, medianprops={"color": "cyan"},
width=0.4, fliersize=1, linewidth=2, notch=True)
sns.lineplot(data=df_mean, x='x', y='y', color='red', ax=ax2, linestyle='--',
estimator=None, linewidth = 2)
text = f' μ={m1:.1f}\n σ={st1:.1f}'
ax2.annotate(text, xy=(m1, -0.3), rotation=90)
sns.lineplot(data=df_q1, x='x', y='y', color='black', ax=ax2, linestyle='--',
estimator=None, linewidth = 1)
text = f'Q1={q1:.1f} '
ax2.annotate(text, xy=(q1-0.1, 0.25), rotation=90, va='top', ha='right')
sns.lineplot(data=df_q2, x='x', y='y', color='cyan', ax=ax2, linestyle='--',
estimator=None, linewidth = 2)
text = f'med={q2:.1f} '
ax2.annotate(text, xy=(q2, 0.25), rotation=90, va='top', ha='center')
sns.lineplot(data=df_q3, x='x', y='y', color='black', ax=ax2, linestyle='--',
estimator=None, linewidth = 1)
text = f'Q3={q3:.1f} '
ax2.annotate(text, xy=(q3+0.1, 0.25), rotation=90, va='top', ha='left')
plt.sca(ax2)
plt.xlabel(xlabel_, size=12)
plt.ylabel('')
plt.xticks(size=12)
sns.despine(bottom=False, left=False, ax=ax2)
ax2.spines['bottom'].set_color('grey')
ax2.spines['left'].set_color('grey')
ax2.xaxis.set_major_locator(MultipleLocator(movie_x_major))
ax2.xaxis.set_major_formatter('{x:.0f}')
ax2.xaxis.set_minor_locator(MultipleLocator(movie_x_minor))
# -------------------------------------------------
m1 = df_.loc[df['type']=='TV Show', var_].mean()
st1 = df_.loc[df['type']=='TV Show', var_].std()
q1 = df_.loc[df['type']=='TV Show', var_].quantile(.25)
q2 = df_.loc[df['type']=='TV Show', var_].median()
q3 = df_.loc[df['type']=='TV Show', var_].quantile(.75)
#
ax3 = plt.subplot(2,2,2)
sns.histplot(data=df_.loc[df['type']=='TV Show'], x=var_, ax=ax3,
binwidth=tv_x_minor,
color='#3A5BA0', linewidth=2)
df_mean = pd.DataFrame({'x': [m1, m1], 'y': ax3.get_ybound()})
df_q1 = pd.DataFrame({'x': [q1, q1], 'y': ax3.get_ybound()})
df_q2 = pd.DataFrame({'x': [q2, q2], 'y': ax3.get_ybound()})
df_q3 = pd.DataFrame({'x': [q3, q3], 'y': ax3.get_ybound()})
sns.lineplot(data=df_mean, x='x', y='y', color='red', ax=ax3, linestyle='--',
estimator=None, linewidth = 2)
sns.lineplot(data=df_q1, x='x', y='y', color='black', ax=ax3, linestyle='--',
estimator=None, linewidth = 1)
sns.lineplot(data=df_q2, x='x', y='y', color='cyan', ax=ax3, linestyle='--',
estimator=None, linewidth = 2)
sns.lineplot(data=df_q3, x='x', y='y', color='black', ax=ax3, linestyle='--',
estimator=None, linewidth = 1)
plt.sca(ax3)
plt.title(f'{title_} for TV Show', size=16, color='grey')
plt.tick_params(
axis='x', # changes apply to the x-axis
which='both', # both major and minor ticks are affected
bottom=False, # ticks along the bottom edge are off
top=False, # ticks along the top edge are off
labelbottom=False) # labels along the bottom edge are off
plt.xlabel('')
plt.ylabel('')
plt.yticks(size=12)
sns.despine(bottom=True, left=False, trim=True, ax=ax3)
ax3.spines['left'].set_color('grey')
ax3.xaxis.set_major_locator(MultipleLocator(tv_x_major))
ax3.xaxis.set_major_formatter('{x:.0f}')
ax3.xaxis.set_minor_locator(MultipleLocator(tv_x_minor))
#
df_mean['y'] = [-0.3, 0.2]
df_q1['y'] = [0.2, 0.25]
df_q2['y'] = [0.1, 0.25]
df_q3['y'] = [0.2, 0.25]
ax4 = plt.subplot(2,2,4, sharex=ax3)
sns.boxplot(data=df_.loc[df['type']=='TV Show'], x=var_,
color='#3A5BA0', showmeans=True, ax=ax4,
# color='white',
flierprops={"marker": "x"}, medianprops={"color": "cyan"},
width=0.4, fliersize=1, linewidth=2, notch=True)
sns.lineplot(data=df_mean, x='x', y='y', color='red', ax=ax4, linestyle='--',
estimator=None, linewidth = 2)
text = f' μ={m1:.1f}\n σ={st1:.1f}'
ax4.annotate(text, xy=(m1, -0.3), rotation=90)
sns.lineplot(data=df_q1, x='x', y='y', color='black', ax=ax4, linestyle='--',
estimator=None, linewidth = 1)
text = f'Q1={q1:.1f} '
ax4.annotate(text, xy=(q1-0.1, 0.25), rotation=90, va='top', ha='right')
sns.lineplot(data=df_q2, x='x', y='y', color='cyan', ax=ax4, linestyle='--',
estimator=None, linewidth = 2)
text = f'med={q2:.1f} '
ax4.annotate(text, xy=(q2, 0.25), rotation=90, va='top', ha='center')
sns.lineplot(data=df_q3, x='x', y='y', color='black', ax=ax4, linestyle='--',
estimator=None, linewidth = 1)
text = f'Q3={q3:.1f} '
ax4.annotate(text, xy=(q3+0.1, 0.25), rotation=90, va='top', ha='left')
plt.sca(ax4)
plt.xlabel(xlabel_, size=12)
plt.ylabel('')
plt.xticks(size=12)
sns.despine(bottom=False, left=False, ax=ax4)
ax4.spines['bottom'].set_color('grey')
ax4.spines['left'].set_color('grey')
ax4.xaxis.set_major_locator(MultipleLocator(tv_x_major))
ax4.xaxis.set_major_formatter('{x:.0f}')
ax4.xaxis.set_minor_locator(MultipleLocator(tv_x_minor))
ax4.figure.set_size_inches(16,10)
plt.subplots_adjust(top=0.95, right=0.86, hspace=0, wspace=0.1)
plt.show()
plot_hist_box(var_='duration', title_='Distribution of Duration', xlabel_='Duration', movie_x_major=100, movie_x_minor=20, tv_x_major=5, tv_x_minor=1)
plot_hist_box(var_='release_year', title_='Distribution of Release Year', xlabel_='Release Year', movie_x_major=20, movie_x_minor=5, tv_x_major=20, tv_x_minor=5)
plot_hist_box(var_='date_added', title_='Distribution of Date Added', xlabel_='Date Added', movie_x_major=5, movie_x_minor=1, tv_x_major=5, tv_x_minor=1, is_date=True)
df_numeric = df[['show_id', 'type', 'duration', 'date_added', 'release_year']].drop_duplicates()
df_numeric['date_added'] = df_numeric['date_added'].dt.year
sns.pairplot(data=df_numeric, hue='type', hue_order=['Movie', 'TV Show'], palette=['#DC3535', '#3A5BA0'])
plt.show()
sns.heatmap(data=df_numeric.loc[df['type']=='Movie'].corr(), annot=True, cmap='Blues')
plt.show()
sns.heatmap(data=df_numeric.loc[df['type']=='TV Show'].corr(), annot=True, cmap='Blues')
plt.show()
Show_id:
Type:
Title:
Description:
Rating:
Director:
Cast:
Actor & Director:
Country:
Genre:
Duration:
Release Year:
Date Added:
Contry Wise Insights:
US
India
UK
Canada
France
Japan
South Korea
Generic Insights
Additional County Specific Insights
Business Recommendation